With increasing economic challenges and daily financial demands, the demand for loans continues to rise. Financial institutions receive a large number of loan applications each day, making it difficult to manage them, particularly when evaluations are done manually and there are no reliable methods to assess a candidate’s creditworthiness (Mnkandla et al., 2024). To simplify the process for both applicants and financial institutions, it is essential to develop robust evaluation methods that can accurately predict a loan applicant’s creditworthiness while minimizing risks for lenders. To achieve this, we have decided to explore an existing dataset to identify patterns across various features of loan applicants and determine which factors have the greatest impact on loan approvals. Once these key features are identified, predictive models can be built around them, reducing the need for manual evaluation and enabling data-driven decision-making based on actual application data.
To investigate the factors that influence loan decisions, we obtained a dataset from Kaggle (Sharma, 2023) containing various variables for each loan application, such as income, education, CIBIL score, and the final loan status (Approved or Rejected). Our goal is to examine whether each of these variables has any impact on the loan decision. We have formulated around 10 smart questions that explore the potential effect of each variable on the final outcome. The insights gained from this exploratory analysis will guide the development of predictive models capable of accurately determining whether a loan should be approved for a given applicant.
# reading dataset
loan_df <- data.frame(read.csv("loan_approval_dataset.csv"))
# exploring dataset
str(loan_df)
## 'data.frame': 4269 obs. of 13 variables:
## $ loan_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ no_of_dependents : int 2 0 3 3 5 0 5 2 0 5 ...
## $ education : chr " Graduate" " Not Graduate" " Graduate" " Graduate" ...
## $ self_employed : chr " No" " Yes" " No" " No" ...
## $ income_annum : int 9600000 4100000 9100000 8200000 9800000 4800000 8700000 5700000 800000 1100000 ...
## $ loan_amount : int 29900000 12200000 29700000 30700000 24200000 13500000 33000000 15000000 2200000 4300000 ...
## $ loan_term : int 12 8 20 8 20 10 4 20 20 10 ...
## $ cibil_score : int 778 417 506 467 382 319 678 382 782 388 ...
## $ residential_assets_value: int 2400000 2700000 7100000 18200000 12400000 6800000 22500000 13200000 1300000 3200000 ...
## $ commercial_assets_value : int 17600000 2200000 4500000 3300000 8200000 8300000 14800000 5700000 800000 1400000 ...
## $ luxury_assets_value : int 22700000 8800000 33300000 23300000 29400000 13700000 29200000 11800000 2800000 3300000 ...
## $ bank_asset_value : int 8000000 3300000 12800000 7900000 5000000 5100000 4300000 6000000 600000 1600000 ...
## $ loan_status : chr " Approved" " Rejected" " Rejected" " Rejected" ...
The dataset has 4269 observations with 13 variables
Removing loan_id column since it only serves as a unique
identifier and does not contribute to the analysis. Furthermore,
converting several numeric variables into factors, as they represent
categorical information rather than continuous numerical values.
loan_df <- subset(loan_df, select = -c(loan_id))
# converting the numeric variables to factor variables
loan_df$no_of_dependents = as.factor(loan_df$no_of_dependents)
loan_df$education = as.factor(loan_df$education)
loan_df$self_employed = as.factor(loan_df$self_employed)
loan_df$loan_status = as.factor(loan_df$loan_status)
str(loan_df)
## 'data.frame': 4269 obs. of 12 variables:
## $ no_of_dependents : Factor w/ 6 levels "0","1","2","3",..: 3 1 4 4 6 1 6 3 1 6 ...
## $ education : Factor w/ 2 levels " Graduate"," Not Graduate": 1 2 1 1 2 1 1 1 1 2 ...
## $ self_employed : Factor w/ 2 levels " No"," Yes": 1 2 1 1 2 2 1 2 2 1 ...
## $ income_annum : int 9600000 4100000 9100000 8200000 9800000 4800000 8700000 5700000 800000 1100000 ...
## $ loan_amount : int 29900000 12200000 29700000 30700000 24200000 13500000 33000000 15000000 2200000 4300000 ...
## $ loan_term : int 12 8 20 8 20 10 4 20 20 10 ...
## $ cibil_score : int 778 417 506 467 382 319 678 382 782 388 ...
## $ residential_assets_value: int 2400000 2700000 7100000 18200000 12400000 6800000 22500000 13200000 1300000 3200000 ...
## $ commercial_assets_value : int 17600000 2200000 4500000 3300000 8200000 8300000 14800000 5700000 800000 1400000 ...
## $ luxury_assets_value : int 22700000 8800000 33300000 23300000 29400000 13700000 29200000 11800000 2800000 3300000 ...
## $ bank_asset_value : int 8000000 3300000 12800000 7900000 5000000 5100000 4300000 6000000 600000 1600000 ...
## $ loan_status : Factor w/ 2 levels " Approved"," Rejected": 1 2 2 2 2 2 1 2 1 2 ...
After conversion, we have the following variables in the dataset:
library(knitr)
# Sample data frame
loan_data <- data.frame(
Variable = c("no_of_dependents", "education", "self_employed", "income_annum", "loan_amount", "loan_term", "cibil_score", "residential_assets_value", "commercial_assets_value", "luxury_assets_value", "bank_asset_value", "loan_status"),
Description = c("Number of dependents an applicant has ranging from 0 to 5", "Education level (Graduate or Not Graduate)", "Whether the applicant works independently or for an employer", "Annual income of applicant", "Amount of loan requested", "The number of years in which the applicant will repay the loan","Credit score indicating applicants history of repayment", "Value of applicant's residential assets if any", "Value of applicant's commercial assets if any", "Value of applicant's luxury assets if any", "Value of applicant's bank assets if any", "Target variable indicating whether the loan was Approved or Rejected"),
Type = c("Categorical", "Categorical", "Categorical", "Numeric", "Numeric", "Numeric", "Numeric", "Numeric", "Numeric", "Numeric", "Numeric", "Categorical")
)
# Create table
kable(loan_data)
| Variable | Description | Type |
|---|---|---|
| no_of_dependents | Number of dependents an applicant has ranging from 0 to 5 | Categorical |
| education | Education level (Graduate or Not Graduate) | Categorical |
| self_employed | Whether the applicant works independently or for an employer | Categorical |
| income_annum | Annual income of applicant | Numeric |
| loan_amount | Amount of loan requested | Numeric |
| loan_term | The number of years in which the applicant will repay the loan | Numeric |
| cibil_score | Credit score indicating applicants history of repayment | Numeric |
| residential_assets_value | Value of applicant’s residential assets if any | Numeric |
| commercial_assets_value | Value of applicant’s commercial assets if any | Numeric |
| luxury_assets_value | Value of applicant’s luxury assets if any | Numeric |
| bank_asset_value | Value of applicant’s bank assets if any | Numeric |
| loan_status | Target variable indicating whether the loan was Approved or Rejected | Categorical |
sum(is.na(loan_df))
## [1] 0
There are no NA values in the data set
summary(loan_df)
## no_of_dependents education self_employed income_annum
## 0:712 Graduate :2144 No :2119 Min. : 200000
## 1:697 Not Graduate:2125 Yes:2150 1st Qu.:2700000
## 2:708 Median :5100000
## 3:727 Mean :5059124
## 4:752 3rd Qu.:7500000
## 5:673 Max. :9900000
## loan_amount loan_term cibil_score residential_assets_value
## Min. : 300000 Min. : 2.0 Min. :300 Min. : -100000
## 1st Qu.: 7700000 1st Qu.: 6.0 1st Qu.:453 1st Qu.: 2200000
## Median :14500000 Median :10.0 Median :600 Median : 5600000
## Mean :15133450 Mean :10.9 Mean :600 Mean : 7472617
## 3rd Qu.:21500000 3rd Qu.:16.0 3rd Qu.:748 3rd Qu.:11300000
## Max. :39500000 Max. :20.0 Max. :900 Max. :29100000
## commercial_assets_value luxury_assets_value bank_asset_value
## Min. : 0 Min. : 300000 Min. : 0
## 1st Qu.: 1300000 1st Qu.: 7500000 1st Qu.: 2300000
## Median : 3700000 Median :14600000 Median : 4600000
## Mean : 4973155 Mean :15126306 Mean : 4976692
## 3rd Qu.: 7600000 3rd Qu.:21700000 3rd Qu.: 7100000
## Max. :19400000 Max. :39200000 Max. :14700000
## loan_status
## Approved:2656
## Rejected:1613
##
##
##
##
Overall, the dataset appears well-structured and balanced.
Categorical variables such as education,
self_employment, and loan_status show nearly
even distributions across their categories. Most
numerical variables fall within reasonable ranges, with
average loan terms around 11 years and CIBIL scores
centered near 600, both indicating realistic applicant
profiles. However, some financial variables like income, loan amount,
and asset values show wide variation and possible outliers. In
particular, the presence of negative values in
residential_assets_value points to potential data quality
issues that will need correction before exploration or testing.
Checking the number of rows that have negative values for
residential_assets_value column
sum(loan_df$residential_assets_value < 0)
## [1] 28
There are 28 entries with negative values.
loan_df <- loan_df[loan_df$residential_assets_value >= 0, ]
str(loan_df)
## 'data.frame': 4241 obs. of 12 variables:
## $ no_of_dependents : Factor w/ 6 levels "0","1","2","3",..: 3 1 4 4 6 1 6 3 1 6 ...
## $ education : Factor w/ 2 levels " Graduate"," Not Graduate": 1 2 1 1 2 1 1 1 1 2 ...
## $ self_employed : Factor w/ 2 levels " No"," Yes": 1 2 1 1 2 2 1 2 2 1 ...
## $ income_annum : int 9600000 4100000 9100000 8200000 9800000 4800000 8700000 5700000 800000 1100000 ...
## $ loan_amount : int 29900000 12200000 29700000 30700000 24200000 13500000 33000000 15000000 2200000 4300000 ...
## $ loan_term : int 12 8 20 8 20 10 4 20 20 10 ...
## $ cibil_score : int 778 417 506 467 382 319 678 382 782 388 ...
## $ residential_assets_value: int 2400000 2700000 7100000 18200000 12400000 6800000 22500000 13200000 1300000 3200000 ...
## $ commercial_assets_value : int 17600000 2200000 4500000 3300000 8200000 8300000 14800000 5700000 800000 1400000 ...
## $ luxury_assets_value : int 22700000 8800000 33300000 23300000 29400000 13700000 29200000 11800000 2800000 3300000 ...
## $ bank_asset_value : int 8000000 3300000 12800000 7900000 5000000 5100000 4300000 6000000 600000 1600000 ...
## $ loan_status : Factor w/ 2 levels " Approved"," Rejected": 1 2 2 2 2 2 1 2 1 2 ...
summary(loan_df)
## no_of_dependents education self_employed income_annum
## 0:706 Graduate :2127 No :2106 Min. : 200000
## 1:696 Not Graduate:2114 Yes:2135 1st Qu.:2700000
## 2:701 Median :5100000
## 3:725 Mean :5074251
## 4:744 3rd Qu.:7500000
## 5:669 Max. :9900000
## loan_amount loan_term cibil_score residential_assets_value
## Min. : 300000 Min. : 2.0 Min. :300 Min. : 0
## 1st Qu.: 7700000 1st Qu.: 6.0 1st Qu.:453 1st Qu.: 2200000
## Median :14600000 Median :10.0 Median :600 Median : 5700000
## Mean :15178401 Mean :10.9 Mean :600 Mean : 7522613
## 3rd Qu.:21500000 3rd Qu.:16.0 3rd Qu.:747 3rd Qu.:11400000
## Max. :39500000 Max. :20.0 Max. :900 Max. :29100000
## commercial_assets_value luxury_assets_value bank_asset_value
## Min. : 0 Min. : 300000 Min. : 0
## 1st Qu.: 1300000 1st Qu.: 7500000 1st Qu.: 2400000
## Median : 3700000 Median :14600000 Median : 4600000
## Mean : 4985121 Mean :15171210 Mean : 4991488
## 3rd Qu.: 7700000 3rd Qu.:21700000 3rd Qu.: 7100000
## Max. :19400000 Max. :39200000 Max. :14700000
## loan_status
## Approved:2640
## Rejected:1601
##
##
##
##
After filtering out the rows with negative values, the dataset now
has 4,241 observations. The distributions across all
variables look almost identical to before, showing that this cleaning
step didn’t affect the data overall.
Here, we aim to examine whether the person’s
education (Graduated or
Not Graduated) affects the approval or rejection of a loan.
We begin by visualizing the distribution of approved and rejected
applications for each education group using a bar chart.
library(ggplot2)
ggplot(loan_df, aes(x = loan_status, fill = education)) +
geom_bar(position = "dodge") +
labs(
title = "Loan Approval by Education Level",
x = "Loan Status",
y = "Number of Applicants"
) +
scale_fill_manual(
values = c(" Graduate" = "#FF7C61", " Not Graduate" = "#50E5C8")
) +
theme_minimal()
Approval rates are nearly identical for Graduated and Not Graduated applicants, suggesting education likely doesn’t affect loan approval. A Chi-square test will confirm this statistically.
edu_table <- table(loan_df$education, loan_df$loan_status)
print(edu_table)
##
## Approved Rejected
## Graduate 1329 798
## Not Graduate 1311 803
edu_chi <- chisq.test(edu_table)
print(edu_chi)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: edu_table
## X-squared = 0.08, df = 1, p-value = 0.8
The Chi-square test yielded a p-value of
0.8. Since this is greater than 0.05, we fail
to reject the null hypothesis, indicating that there is no significant
association between education level and loan approval. This confirms
that, in this dataset, education does not appear to affect the
likelihood of a loan being approved.
Here, we aim to examine whether the value of a person’s assets, including residential, commercial, luxury, and bank assets, individually affects the approval or rejection of a loan. We begin by visualizing the frequency distribution of each type of asset value.
# 1. Residential Assets Histogram
hist(loan_df$residential_assets_value,
main = "Distribution of Residential Assets Value",
xlab = "Residential Assets Value",
col = "lightblue",
border = "black")
# 2. Commercial Assets Histogram
hist(loan_df$commercial_assets_value,
main = "Distribution of Commercial Assets Value",
xlab = "Commercial Assets Value",
col = "lightgreen",
border = "black")
# 3. Luxury Assets Histogram
hist(loan_df$luxury_assets_value,
main = "Distribution of Luxury Assets Value",
xlab = "Luxury Assets Value",
col = "pink",
border = "black")
# 4. Bank Assets Histogram
hist(loan_df$bank_asset_value,
main = "Distribution of Bank Assets Value",
xlab = "Bank Assets Value",
col = "salmon",
border = "black")
# Reset plotting layout to default (1 plot per screen) after generation
par(mfrow = c(1, 1))
Each type of asset value shows a right-skewed distribution, meaning most individuals hold lower asset values while a smaller number possess much higher ones, indicating non-normality. To statistically verify this, we can apply the Shapiro–Wilk test, which checks whether the data follow a normal distribution.
shapiro.test(loan_df$residential_assets_value)
##
## Shapiro-Wilk normality test
##
## data: loan_df$residential_assets_value
## W = 0.9, p-value <2e-16
shapiro.test(loan_df$commercial_assets_value)
##
## Shapiro-Wilk normality test
##
## data: loan_df$commercial_assets_value
## W = 0.9, p-value <2e-16
shapiro.test(loan_df$luxury_assets_value)
##
## Shapiro-Wilk normality test
##
## data: loan_df$luxury_assets_value
## W = 1, p-value <2e-16
shapiro.test(loan_df$bank_asset_value)
##
## Shapiro-Wilk normality test
##
## data: loan_df$bank_asset_value
## W = 1, p-value <2e-16
Since all p-values are far below 0.05,
we reject the null hypothesis for each test. This
indicates that none of the asset value distributions are normally
distributed, which is consistent with the right-skewed patterns observed
in the histograms. Let’s visualize the impact of asset values on loan
status using a box plot.
# Convert to long format
loan_long <- loan_df %>%
pivot_longer(
cols = c(residential_assets_value, commercial_assets_value, luxury_assets_value, bank_asset_value),
names_to = "asset_type",
values_to = "asset_value"
)
# Boxplots all in one figure
ggplot(loan_long, aes(x = asset_type, y = asset_value, fill = loan_status)) +
geom_boxplot(position = position_dodge(width = 0.8)) +
labs(
title = "Asset Distributions by Loan Status",
x = "Asset Type",
y = "Asset Value"
) +
theme_minimal() +
scale_fill_manual(values = c(" Approved" = "#1b9e77", " Rejected" = "#d95f02"))
The median and range of each asset
type are similar between
Approved and Rejected
loans, suggesting little apparent effect on loan decisions. Since the
Shapiro-Wilk test showed that asset values are not normally distributed,
we use the Wilcoxon rank-sum test to formally assess differences.
wilcox.test(residential_assets_value ~ loan_status, data = loan_df)
##
## Wilcoxon rank sum test with continuity correction
##
## data: residential_assets_value by loan_status
## W = 2e+06, p-value = 0.2
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(commercial_assets_value ~ loan_status, data = loan_df)
##
## Wilcoxon rank sum test with continuity correction
##
## data: commercial_assets_value by loan_status
## W = 2e+06, p-value = 0.7
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(luxury_assets_value ~ loan_status, data = loan_df)
##
## Wilcoxon rank sum test with continuity correction
##
## data: luxury_assets_value by loan_status
## W = 2e+06, p-value = 0.2
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(bank_asset_value ~ loan_status, data = loan_df)
##
## Wilcoxon rank sum test with continuity correction
##
## data: bank_asset_value by loan_status
## W = 2e+06, p-value = 0.4
## alternative hypothesis: true location shift is not equal to 0
All p-values are greater than 0.05, so
we fail to reject the null hypothesis for each asset
type. This indicates that there is no statistically significant
difference in asset values between Approved and Rejected loans,
supporting the initial observation that asset values do not
appear to affect loan approval decisions.
Here, we aim to examine whether the person’s annual income affects the approval or rejection of a loan. We begin by visualizing the frequency distribution of annual income.
hist(loan_df$income_annum,
main = "Distribution of Annual income",
xlab = "Annual income",
col = "salmon",
border = "black")
The frequency distribution shows that annual income is roughly
uniform and symmetric, with no noticeable skew or
extreme outliers. We can now examine the distribution of annual
income for
Approved and Rejected applicants
using a box plot.
ggplot(loan_df, aes(x = loan_status, y = income_annum, fill = loan_status)) +
geom_boxplot() +
labs(title = "Annual Distribution by Loan Status", x = "Loan Status", y = "Annual Income") +
scale_fill_manual(values = c(" Approved" = "#12362A", " Rejected" = "#8DD9C1"))
The box plot shows that the income distributions for both
Approved and Rejected applicants are quite
similar. The median income for both groups is
approximately 5 million, and the small difference in the
lower quartiles appears negligible. To statistically verify this
observation, we will perform a two-sample t-test to compare the means of
the two groups.
Null Hypothesis (H₀): There is no significant difference in the average annual income between approved and rejected applicants.
Alternative Hypothesis (H₁): There is a significant difference in the average annual income between approved and rejected applicants.
approved_income <- loan_df$income_annum[loan_df$loan_status == " Approved"]
rejected_income <- loan_df$income_annum[loan_df$loan_status == " Rejected"]
t.test(approved_income, rejected_income)
##
## Welch Two Sample t-test
##
## data: approved_income and rejected_income
## t = -1, df = 3435, p-value = 0.2
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -277421 68867
## sample estimates:
## mean of x mean of y
## 5034886 5139163
Since the p-value(0.2)> 0.05, we fail to
reject the null hypothesis, indicating that there is no
statistically significant difference in mean income between Approved and
Rejected applicants.So the data supports the earlier observation from
the box plot i.e. annual income does not appear to affect loan approval
in this dataset.
Here, we aim to examine whether the loan amount requested is influenced by an applicant’s annual income. Understanding this relationship helps identify potential multicollinearity. Moreover, We already know that annual income is not skewed and roughly uniform and symmetric, we can now visualize the distribution for loan amount requested.
hist(loan_df$loan_amount,
main = "Distribution of Loan amount requested",
xlab = "Loan amount requested",
col = "yellow",
border = "black")
The distribution of loan amount is right-skewed and deviates from
normality. Therefore, we will first visualize the relationship between
annual income and the requested loan amount, and then use Spearman’s
correlation test, as it does not assume normality.
annual_income <- loan_df$income_annum
loan_amount <- loan_df$loan_amount
ggplot(data = loan_df, aes(x = loan_amount, y = income_annum)) +
geom_point(color = "#F7AC19") +
labs(
title = "Scatter plot of Annual Income vs Loan Amount",
x = "Loan Amount",
y = "Annual Income"
) +
theme_minimal()
A positive
correlation is evident, indicating that applicants with higher annual
incomes tend to apply for and receive larger loan amounts. We can use a
spearman correlation test to verify this.
Null Hypothesis (H₀): There is no relationship between annual income and loan amount among applicants.
Alternative Hypothesis (H₁): Applicants with higher annual incomes tend to apply for and receive larger loan amounts.
cor.test(annual_income, loan_amount, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: annual_income and loan_amount
## S = 8e+08, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.941
The Spearman’s rank correlation test shows a very strong positive
correlation between annual income and loan amount
(ρ = 0.941, p < 0.001). This indicates that as
applicants’ annual income increases, the loan amount they apply for and
receive also tends to increase. Since the p-value is far below
0.05, we reject the null hypothesis and conclude that there
is a significant positive relationship between annual income and loan
amount. This strong association suggests that the two variables convey
similar information.
Here, we are investigating to see if there is a significant difference in CIBIL Scores of Approved and Rejected candidates. First we plot the entire distribution to identify outliers then we plot a chart showing the Approved and Rejected populations.
library(ggplot2)
#CIBIL score frequency distribution
hist(loan_df$cibil_score, main = "Frequency Distribution of CIBIL Scores", xlab = "CIBIL Score", ylab = "Frequency", breaks = 15, col = "#197571")
#CIBIL score frequency box chart
ggplot(loan_df, aes(x = loan_status, y = cibil_score, fill = loan_status)) +
geom_boxplot() +
labs(title = "CIBIL Score Distribution by Loan Status", x = "Loan Status", y = "CIBIL Score") +
scale_fill_manual(values = c(" Approved" = "#197571", " Rejected" = "#FFCF85"))
#t-test
t.test(cibil_score ~ loan_status, data = loan_df)
##
## Welch Two Sample t-test
##
## data: cibil_score by loan_status
## t = 88, df = 4238, p-value <2e-16
## alternative hypothesis: true difference in means between group Approved and group Rejected is not equal to 0
## 95 percent confidence interval:
## 268 280
## sample estimates:
## mean in group Approved mean in group Rejected
## 703 429
We use this test to confirm whether the CIBIL Scores for the Approved and Rejected applicants are statistically significantly different or the same.
The overall CIBIL score frequency distribution shows no apparent outliers. The individual distributions for approved and rejected loans show a few outliers, but it is not necessary to remove them. Even though the outliers are dragging the two means closer to each other, the means are still significantly different even when including them in the t-test.
There is a significant difference in CIBIL scores between approved
and rejected applicants because the p-value, 2e-16, is much
lower than the standard alpha threshold of 0.05. This
allows us to reject the null hypothesis that the two means of approved
and rejected applicants are equal.
Here, we are investigating to see if there is a significant difference in CIBIL Scores in the Rejected applicant population depending if they want a short (less than or equal to 10 years) vs longer term loan (greater than 10 years). First we plot the entire distribution to identify outliers then we plot a chart showing the Rejected population with shorter and longer term loans.
#created subsets for shorter and longer loans then ran t-test
rejected_shorter <- subset(loan_df, trimws(loan_status) == "Rejected" & loan_term <= 10)
rejected_longer <- subset(loan_df, trimws(loan_status) == "Rejected" & loan_term > 10)
#Overall loan term frequency distribution
ggplot(loan_df, aes(x = factor(loan_term))) + geom_bar(fill = "#7B3D91") + labs(title = "Frequency Distribution of Loan Terms", x = "Loan Term (Years)", y = "Frequency")
#boxplot
boxplot(
rejected_shorter$cibil_score,
rejected_longer$cibil_score,
names = c("Shorter (<= 10 yrs)", "Longer (> 10 yrs)"),
main = "CIBIL Score Distribution for Rejected Loans",
ylab = "CIBIL Score",
xlab = "Loan Term Group",
col = c("#A769C2", "#87BDCC") # add your custom colors here
)
#removing outliers
removeOutliers <- function(x) {Q1 <- quantile(x, 0.25)
Q3 <- quantile(x, 0.75);
IQR_val <- Q3 - Q1;
lower_bound <- Q1 - (1.5 * IQR_val);
upper_bound <- Q3 + (1.5 * IQR_val);
return(x[x >= lower_bound & x <= upper_bound]);}
cleaned_shorter_cibil <- removeOutliers(rejected_shorter$cibil_score)
cleaned_longer_cibil <- removeOutliers(rejected_longer$cibil_score)
t.test(rejected_shorter$cibil_score, rejected_longer$cibil_score)
##
## Welch Two Sample t-test
##
## data: rejected_shorter$cibil_score and rejected_longer$cibil_score
## t = -0.03, df = 1564, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.72 7.51
## sample estimates:
## mean of x mean of y
## 429 429
t.test(cleaned_shorter_cibil, cleaned_longer_cibil)
##
## Welch Two Sample t-test
##
## data: cleaned_shorter_cibil and cleaned_longer_cibil
## t = -0.05, df = 1563, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.47 7.09
## sample estimates:
## mean of x mean of y
## 427 428
We use this test to confirm whether the CIBIL Scores for the Rejected applicants with shorter loan terms and longer loan terms are statistically significantly different or the same.
The overall loan terms frequency distribution shows no apparent outliers however there are a few in the shorter and longer loan term groups. Having this said, even when removing these outliers there no significant differences found between the two groups and their respective CIBIL scores.
There is not a significant difference in CIBIL scores between shorter
(<= 10 years) and longer (>10 years)
term loans within the rejected group of applicants because the
p-value, 1, is much higher than the
standard alpha threshold of 0.05. This allows us to accept
the null hypothesis that the two means of shorter and longer term loans
of rejected applicants are equal.
Here, we are investigating whether or not there is a correlation among Approved applicants between the two variables, CIBIL Score and Loan Term (Years). A correlation test is conducted to determine the correlation coefficient and p-value. A graph is also plotted showing the relationship between CIBIL Score and Loan Term.
#correlation coefficient
corr_r <- cor(loan_df[trimws(loan_df$loan_status) == "Approved", "cibil_score"], loan_df[trimws(loan_df$loan_status) == "Approved", "loan_term"], method = "pearson")
#correlation test
cor.test(loan_df[trimws(loan_df$loan_status) == "Approved", "cibil_score"], loan_df[trimws(loan_df$loan_status) == "Approved", "loan_term"], method = "pearson")
##
## Pearson's product-moment correlation
##
## data: loan_df[trimws(loan_df$loan_status) == "Approved", "cibil_score"] and loan_df[trimws(loan_df$loan_status) == "Approved", "loan_term"]
## t = 11, df = 2638, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.173 0.246
## sample estimates:
## cor
## 0.21
#scatter plot
ggplot(subset(loan_df, trimws(loan_status) == "Approved"), aes(x = loan_term, y = cibil_score)) + geom_point(alpha = 0.4, color = "#AAE5D8") + geom_smooth(method = "lm", se = FALSE, color = "#061411") + labs(title = "CIBIL Score vs. Loan Term for Approved Loans", x = "Loan Term (Years)", y = "CIBIL Score")
We use this test to confirm whether there is a correlation between CIBIL Scores and loan terms for the applicants who were Approved for their loan.
The correlation coefficient of 0.21 shows a weak but
positive relationship where approved candidates with higher CIBIL scores
have a slightly higher tendency to approved for longer loan terms. And
the p-value is very small, 2e-16, meaning that the
correlation coefficient is unlikely due to chance.
Here, we aim to examine whether the person’s employment status affects the approval or rejection of a loan. We begin by visualizing the distribution of approved and rejected applications for each employment group using a bar chart.
ggplot(loan_df, aes(x = self_employed, fill = loan_status)) +
geom_bar(position = "dodge") +
labs(
title = "Loan Approval by self employment status",
x = "self employment status",
y = "Number of Applicants"
) +
scale_fill_manual(values = c(" Approved" = "#F2764E", " Rejected" = "#FAF487")) +
theme_minimal()
Approval
rates are nearly identical for self-employed and non
self-employed applicants, suggesting self-employment likely doesn’t
affect loan approval. A Chi-square test will confirm this
statistically.
contable <- table(loan_df$self_employed, loan_df$loan_status)
chisq.test(contable)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contable
## X-squared = 0.009, df = 1, p-value = 0.9
The Chi-square test (p = 1) also shows no significant
association between self-employment status and loan approval. This
confirms that self-employment does not affect the likelihood of loan
approval in this dataset.
Here, we will see whether the person’s dependents affects the approval or rejection of a loan. We begin by visualizing the distribution of approved and rejected applications for each dependents’ group using a bar chart.
ggplot(loan_df, aes(x = no_of_dependents, fill = loan_status)) +
geom_bar(position = "dodge") +
labs(
title = "Loan Approval by Number of Dependents",
x = "Number of Dependents",
y = "Number of Applicants"
) +
scale_fill_manual(values = c(" Approved" = "#0D0942", " Rejected" = "#C4C0F6")) +
theme_minimal()
The bar
chart shows that approval and rejection counts are fairly similar across
all dependent categories, with no noticeable trend suggesting that the
number of dependents strongly influences loan approval outcomes. While
minor variations exist, the approval rate appears consistent across
groups, indicating that loan approval is likely independent of the
number of dependents. A Chi-square test will confirm this
statistically.
contable <- table(loan_df$no_of_dependents, loan_df$loan_status)
chisq.test(contable)
##
## Pearson's Chi-squared test
##
## data: contable
## X-squared = 2, df = 5, p-value = 0.8
The Chi-square test (p = 0.8) also shows no significant
association between no_of_dependents and loan approval.
This confirms that no_of_dependents does not affect the
likelihood of loan approval in this dataset.
The goal is to see whether loan term will affect the approval of loan, we have already seen the distribution of loan terms so now we will use a box plot to visualize whether certain loan terms have more chance of approval than others.
ggplot(loan_df, aes(x = loan_status, y = loan_term, fill = loan_status)) +
geom_boxplot(alpha = 0.6) +
labs(
title = "Loan Term by Loan Approval Status",
x = "Loan Status",
y = "Loan Term (Years)"
) +
scale_fill_manual(values = c(" Approved" = "#98C25F", " Rejected" = "#EB3BA7")) +
theme_minimal()
The
boxplot shows that approved loans have a wider range of
terms, including very short loans, while rejected loans generally have
longer terms. The median loan term for approved loans is
10 years, compared to 12 years for rejected
loans, indicating that rejected loans tend to have slightly longer
terms. To statistically verify this we will perform a t-test with
following null and alternate hypothesis:
t.test(loan_term ~ loan_status, data = loan_df, tail="greater")
##
## Welch Two Sample t-test
##
## data: loan_term by loan_status
## t = -8, df = 3637, p-value = 2e-14
## alternative hypothesis: true difference in means between group Approved and group Rejected is not equal to 0
## 95 percent confidence interval:
## -1.69 -1.01
## sample estimates:
## mean in group Approved mean in group Rejected
## 10.4 11.7
The two-sample t-test indicates a significant difference in means
(p < 0.05) so we can reject the null hypothesis, with
approved loans having a shorter average term (10.4 years)
than rejected loans (11.7 years). The 95%
confidence interval for the mean difference is
-1.67 to -0.988 (the interval doesn’t have 0), confirming
that approved loans tend to have shorter terms.
The exploratory data analysis and hypothesis testing conducted in this study provide several key insights into the factors influencing loan approval outcomes. The results indicate that CIBIL score serves as a significant determinant of approval decisions, with approved applicants demonstrating notably higher average scores than rejected ones. This reinforces the importance of an applicant’s creditworthiness in lending assessments. Additionally, the analysis revealed that loan term plays a critical role, as approved loans tend to have shorter durations, suggesting that lenders may prefer applicants seeking lower-risk, shorter-term loans.
While annual income exhibited a strong positive correlation with the loan amount requested, it did not display a significant direct relationship with loan approval status. This finding implies that although applicants with higher incomes tend to request larger loans, income alone may not substantially affect the likelihood of approval once other financial indicators are considered. Thus, CIBIL score and loan term emerge as the most impactful variables for understanding and predicting loan approval decisions, whereas annual income may serve as a supporting predictor variable rather than a primary determinant.
In contrast, variables such as assets value, loan amount requested, and employment status did not exhibit a notable impact on loan approval outcomes during this exploratory phase. While these features may still add some background information, they appear to play a limited role in influencing the approval decision compared to credit and term-related variables.
Future work should focus on developing predictive models to validate these findings and quantify the relative influence of key variables. Variables such as CIBIL score, loan term, and annual income can be incorporated into classification models such as linear and logistic regression to predict loan approval outcomes.
Overall, this study establishes a foundational understanding of the most influential factors affecting loan approval and provides a data-driven basis for future modeling in credit risk evaluation.
data_kmeans <- loan_df[, c("income_annum", "loan_amount", "cibil_score", "residential_assets_value", "commercial_assets_value", "luxury_assets_value", "bank_asset_value")]
cor(data_kmeans)
## income_annum loan_amount cibil_score
## income_annum 1.0000 0.9271 -0.0235
## loan_amount 0.9271 1.0000 -0.0175
## cibil_score -0.0235 -0.0175 1.0000
## residential_assets_value 0.6363 0.5940 -0.0184
## commercial_assets_value 0.6389 0.6016 -0.0053
## luxury_assets_value 0.9287 0.8599 -0.0294
## bank_asset_value 0.8502 0.7871 -0.0154
## residential_assets_value commercial_assets_value
## income_annum 0.6363 0.6389
## loan_amount 0.5940 0.6016
## cibil_score -0.0184 -0.0053
## residential_assets_value 1.0000 0.4146
## commercial_assets_value 0.4146 1.0000
## luxury_assets_value 0.5903 0.5893
## bank_asset_value 0.5263 0.5468
## luxury_assets_value bank_asset_value
## income_annum 0.9287 0.8502
## loan_amount 0.8599 0.7871
## cibil_score -0.0294 -0.0154
## residential_assets_value 0.5903 0.5263
## commercial_assets_value 0.5893 0.5468
## luxury_assets_value 1.0000 0.7876
## bank_asset_value 0.7876 1.0000
The correlation matrix shows that all asset variables are positively and strongly related, so I combined them into a single total asset value to avoid redundancy.
str(loan_df)
## 'data.frame': 4241 obs. of 12 variables:
## $ no_of_dependents : Factor w/ 6 levels "0","1","2","3",..: 3 1 4 4 6 1 6 3 1 6 ...
## $ education : Factor w/ 2 levels " Graduate"," Not Graduate": 1 2 1 1 2 1 1 1 1 2 ...
## $ self_employed : Factor w/ 2 levels " No"," Yes": 1 2 1 1 2 2 1 2 2 1 ...
## $ income_annum : int 9600000 4100000 9100000 8200000 9800000 4800000 8700000 5700000 800000 1100000 ...
## $ loan_amount : int 29900000 12200000 29700000 30700000 24200000 13500000 33000000 15000000 2200000 4300000 ...
## $ loan_term : int 12 8 20 8 20 10 4 20 20 10 ...
## $ cibil_score : int 778 417 506 467 382 319 678 382 782 388 ...
## $ residential_assets_value: int 2400000 2700000 7100000 18200000 12400000 6800000 22500000 13200000 1300000 3200000 ...
## $ commercial_assets_value : int 17600000 2200000 4500000 3300000 8200000 8300000 14800000 5700000 800000 1400000 ...
## $ luxury_assets_value : int 22700000 8800000 33300000 23300000 29400000 13700000 29200000 11800000 2800000 3300000 ...
## $ bank_asset_value : int 8000000 3300000 12800000 7900000 5000000 5100000 4300000 6000000 600000 1600000 ...
## $ loan_status : Factor w/ 2 levels " Approved"," Rejected": 1 2 2 2 2 2 1 2 1 2 ...
loan_df$combined_asset_value <- rowSums(
loan_df[, c("residential_assets_value",
"commercial_assets_value",
"luxury_assets_value",
"bank_asset_value")],
na.rm = TRUE
)
data_kmeans <- loan_df[, c("loan_amount", "combined_asset_value", "cibil_score", "income_annum")]
str(data_kmeans)
## 'data.frame': 4241 obs. of 4 variables:
## $ loan_amount : int 29900000 12200000 29700000 30700000 24200000 13500000 33000000 15000000 2200000 4300000 ...
## $ combined_asset_value: num 50700000 17000000 57700000 52700000 55000000 33900000 70800000 36700000 5500000 9500000 ...
## $ cibil_score : int 778 417 506 467 382 319 678 382 782 388 ...
## $ income_annum : int 9600000 4100000 9100000 8200000 9800000 4800000 8700000 5700000 800000 1100000 ...
data_kmeans_scaled <- scale(data_kmeans)
wss <- sapply(1:10, function(k){
kmeans(data_kmeans_scaled, k, nstart = 20)$tot.withinss
})
plot(1:10, wss, type = "b",
xlab = "Number of clusters (k)",
ylab = "Total within-cluster sum of squares")
The elbow plot indicates a distinct inflection point at
k = 2, which supports selecting two clusters for the
analysis.
set.seed(123)
k <- 2
km <- kmeans(data_kmeans_scaled, centers = k, nstart = 25)
library(cluster)
sil <- silhouette(km$cluster, dist(data_kmeans_scaled))
mean(sil[, 3])
## [1] 0.423
The silhouette score of 0.423 indicates that the two borrower clusters are reasonably well-separated, meaning the clustering has captured meaningful differences in income, assets, loan amount, and CIBIL score. While the groups are not perfectly distinct, the separation is strong enough to provide useful insights into different borrower profiles.
Attaching cluster labels to both the scaled and original datasets, then calculating the average unscaled feature values for each cluster to create interpretable cluster profiles.
# df_scaled already contains cluster labels
df_scaled <- as.data.frame(data_kmeans_scaled)
df_scaled$cluster <- factor(km$cluster)
# -----------------------
# 1. Add cluster labels to ORIGINAL data
# -----------------------
df_unscaled <- data_kmeans # original (unscaled) data
df_unscaled$cluster <- factor(km$cluster)
# -----------------------
# 2. Compute cluster profiles using unscaled values
# -----------------------
cluster_profiles_unscaled <- df_unscaled %>%
group_by(cluster) %>%
summarise(
loan_amount = mean(loan_amount),
cibil_score = mean(cibil_score),
income_annum = mean(income_annum),
combined_asset_value = mean(combined_asset_value)
)
print(cluster_profiles_unscaled)
## # A tibble: 2 × 5
## cluster loan_amount cibil_score income_annum combined_asset_value
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 22662709. 594. 7510627. 48803207.
## 2 2 7913197. 606. 2709201. 17009944.
Cluster 1: High-value borrowers: This cluster represents financially strong borrowers who request very large loans and possess significant asset holdings and high annual income. Despite having slightly lower CIBIL scores on average, their high net worth and strong cash flow make them lower risk from a collateral perspective.
Cluster 2: Moderate-value borrowers: This cluster consists of moderate-income, moderate-asset borrowers who apply for smaller loans and have slightly better credit scores. Their financial profile is less substantial than Cluster 1, but their stronger CIBIL scores might indicate better repayment discipline.
These differences suggest that the clustering effectively separates borrowers based on financial strength and borrowing patterns.
library(caret)
library(e1071)
# Combine original features with cluster labels
cluster_data <- df_unscaled # df_unscaled already includes: loan_amount, combined_asset_value, cibil_score, income_annum, cluster
set.seed(123)
# 80/20 train-test split (stratified by cluster)
index <- createDataPartition(cluster_data$cluster, p = 0.8, list = FALSE)
train_data <- cluster_data[index, ]
test_data <- cluster_data[-index, ]
table(cluster_data$cluster) # Check class balance
##
## 1 2
## 2089 2152
svm_model <- svm(
cluster ~ loan_amount + combined_asset_value + cibil_score + income_annum,
data = train_data,
kernel = "radial",
class.weights = table(cluster_data$cluster) / nrow(cluster_data)
)
predictions <- predict(svm_model, newdata = test_data)
confusion_matrix <- table(predictions, test_data$cluster)
print(confusion_matrix)
##
## predictions 1 2
## 1 415 1
## 2 2 429
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", accuracy))
## [1] "Accuracy: 0.996458087367178"
We trained a Support Vector Machine to classify observations into the two k-means clusters. The model achieved an accuracy of 99.65%, with only 3 misclassifications out of 847 test observations. The confusion matrix shows very strong separation between the clusters, indicating that the cluster structure found by k-means is highly stable and predictable from the input features. This validates that the clusters represent genuinely distinct segments rather than random partitions.
library(fpc)
# calculate CH index
ch <- calinhara(data_kmeans_scaled, km$cluster, cn = 2)
cat("Calinski-Harabasz Index:", ch, "\n")
## Calinski-Harabasz Index: 4531
# compare CH index under different k values
ch_values <- sapply(1:10, function(k) {
km_temp <- kmeans(data_kmeans_scaled, k, nstart = 20)
ch <- calinhara(data_kmeans_scaled, km$cluster, cn = k)
})
plot(1:10, ch_values, type = "b", xlab = "k", ylab = "CH Index", main = "CH Index under different K values")
Calculating the CH index for this k-means model. The principle is to
calculate the ratio of the variance between clusters to the variance
within clusters. The larger the value, the greater the difference
between clusters and the more compact the clusters within. The result
obtained here is 4531, which is a quite large value, indicating that the
clustering result is quite good. Then I compared the CH index under
different k values. As can be seen from the graph, when k = 2, the CH
index is the largest, which once again confirms the same result as the
elbow principle - when the number of clusters is 2, it is the most
suitable.
library(fpc)
set.seed(123)
stab <- clusterboot(data_kmeans_scaled,
clustermethod = kmeansCBI,
k = 2,
runs = 100,
seed = 123)
## boot 1
## boot 2
## boot 3
## boot 4
## boot 5
## boot 6
## boot 7
## boot 8
## boot 9
## boot 10
## boot 11
## boot 12
## boot 13
## boot 14
## boot 15
## boot 16
## boot 17
## boot 18
## boot 19
## boot 20
## boot 21
## boot 22
## boot 23
## boot 24
## boot 25
## boot 26
## boot 27
## boot 28
## boot 29
## boot 30
## boot 31
## boot 32
## boot 33
## boot 34
## boot 35
## boot 36
## boot 37
## boot 38
## boot 39
## boot 40
## boot 41
## boot 42
## boot 43
## boot 44
## boot 45
## boot 46
## boot 47
## boot 48
## boot 49
## boot 50
## boot 51
## boot 52
## boot 53
## boot 54
## boot 55
## boot 56
## boot 57
## boot 58
## boot 59
## boot 60
## boot 61
## boot 62
## boot 63
## boot 64
## boot 65
## boot 66
## boot 67
## boot 68
## boot 69
## boot 70
## boot 71
## boot 72
## boot 73
## boot 74
## boot 75
## boot 76
## boot 77
## boot 78
## boot 79
## boot 80
## boot 81
## boot 82
## boot 83
## boot 84
## boot 85
## boot 86
## boot 87
## boot 88
## boot 89
## boot 90
## boot 91
## boot 92
## boot 93
## boot 94
## boot 95
## boot 96
## boot 97
## boot 98
## boot 99
## boot 100
print(stab$bootmean)
## [1] 0.994 0.995
print(stab$bootbrd)
## [1] 0 0
In the bootstrap sampling verification, this model achieved excellent results with bootmean = 0.994, 0.995 and bootbrd = 0, 0. This indicates that this K-means model has extremely strong clustering stability. The value of bootmean is extremely close to 1, indicating that the two clusters can be consistently reproduced in over 99% of the samples, and there is almost no “false clusters caused by sampling randomness”. “bootbrd” represents the ambiguity of the cluster boundary. The smaller the value, the clearer the cluster boundary. This model has reached the theoretical minimum value. The boundaries of the two clusters are almost completely clear. These two clusters are genuine structures in the data, and not the result of arbitrary division by the clustering algorithm.
For a better understanding, this is the result of PCA dimensionality reduction visualization.
cluster_factor <- factor(km$cluster, levels = c(1, 2), labels = c("Cluster 1", "Cluster 2"))
library(FactoMineR)
library(factoextra)
library(ggplot2)
pca <- PCA(data_kmeans_scaled, graph = FALSE)
fviz_pca_ind(
pca,
geom.ind = "point",
col.ind = cluster_factor,
palette = c("#2E9FDF", "#E7B800"),
legend.title = "Cluster",
title = "PCA Visualization Result",
xlab = "PC1",
ylab = "PC2",
repel = TRUE
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
We can see two clusters can be observed to be almost completely
separated along the PC1 axis (with only a very small amount of overlap
near PC1 = 0)
library(GGally)
plot_data <- cbind(data_kmeans_scaled, cluster = factor(km$cluster))
ggparcoord(plot_data,
columns = 1:4,
groupColumn = "cluster",
scale = "globalminmax",
alphaLines = 0.5) +
labs(title = "Cluster distribution of each feature") +
theme_minimal()
This parallel coordinate plot illustrates the distribution differences
of the two clusters in terms of the four standardized features. Cluster
1 has dark tones, while Cluster 2 has light tones. Firstly, the dark and
light color areas in the figure hardly overlap, indicating that the
characteristic trajectories of the two clusters are significantly
different. Secondly, this graph visually verifies the characteristics of
the two clusters previously separated. The lines of the high-value
cluster are concentrated in the area of “high loan/asset/income and low
CIBIL score”, while the lines of the low-value cluster are concentrated
in the area of “low loan/asset/income and high CIBIL score”.
I think t-tests and ANOVA tests are of little significance because K-means is already used to group variables. If these testing methods are employed, the results will definitely be significant.
The three variables we decided to choose are the CIBIL score, loan amount and loan term.These three would be important because they are vital for the credit risk assessment.
Null Hypothesis: H0 = All three models are equally effective
indicators, or the simplest model is the best since it factors in adding
more variables (AIC1≈AIC2≈AIC3)
Alternative Hypothesis: HA= the model with the minimum AIC is the best
indicator, demonstrating a statistically superior balance of fit and
model simplicity compared to the other two.
First let us run Chi-squared tests to determine dependence of variables.
chisq.test(loan_df$loan_status, loan_df$cibil_score)
##
## Pearson's Chi-squared test
##
## data: loan_df$loan_status and loan_df$cibil_score
## X-squared = 3597, df = 600, p-value <2e-16
chisq.test(loan_df$loan_status, loan_df$loan_amount)
##
## Pearson's Chi-squared test
##
## data: loan_df$loan_status and loan_df$loan_amount
## X-squared = 344, df = 377, p-value = 0.9
chisq.test(loan_df$loan_status, loan_df$loan_term)
##
## Pearson's Chi-squared test
##
## data: loan_df$loan_status and loan_df$loan_term
## X-squared = 150, df = 9, p-value <2e-16
These tests see if two variables are significantly associated with each other. These Chi-squared test on the three variable, CIBIL credit score, loan amount, and loan term, indicate that the CIBIL credit score and loan term are the variables with a p-value lower than the standard significance of .05, both having a p-value being 2.2e-16. This means we reject the null hypothesis that that CIBIL score and loan approval are independent and have no relationship. This means that the CIBIL credit score is dependent on the loan status and that the loan term is significantly associated with the loan status. We fail to reject the null hypotheses for the loan amount compared to loan status, as the p-values is greater than .05. The loan amount p-value being .8637. This means that the loan amount and loan approval are independent from each other.
library(regclass)
library(ResourceSelection)
LogitM1 <- glm(loan_status ~ cibil_score, data = loan_df, family = "binomial")
summary(LogitM1)
##
## Call:
## glm(formula = loan_status ~ cibil_score, family = "binomial",
## data = loan_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.452352 0.373349 30.7 <2e-16 ***
## cibil_score -0.021735 0.000698 -31.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5622.1 on 4240 degrees of freedom
## Residual deviance: 2122.0 on 4239 degrees of freedom
## AIC: 2126
##
## Number of Fisher Scoring iterations: 7
#evaluation step - confidence interval
confint.default(LogitM1)
## 2.5 % 97.5 %
## (Intercept) 10.7206 12.1841
## cibil_score -0.0231 -0.0204
#evaluation step - confusion matrix
confusion_matrix(LogitM1)
## Predicted Approved Predicted Rejected Total
## Actual Approved 2473 167 2640
## Actual Rejected 182 1419 1601
## Total 2655 1586 4241
#evaluation step - Hosmer and Lemeshow
hoslem.test(as.numeric(loan_df$loan_status) - 1, fitted(LogitM1))
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: as.numeric(loan_df$loan_status) - 1, fitted(LogitM1)
## X-squared = 623, df = 8, p-value <2e-16
#evaluation step - McFadden R^2
null_tLogit <- glm(loan_status ~ 1, data = loan_df, family = "binomial")
mcFadden = 1 - logLik(LogitM1) / logLik(null_tLogit)
cat("McFadden R-squared: ", format(mcFadden, digits=3), "\n")
## McFadden R-squared: 0.623
LogitM2 <- glm(loan_status ~ cibil_score + loan_term, data = loan_df, family = "binomial")
summary(LogitM2)
##
## Call:
## glm(formula = loan_status ~ cibil_score + loan_term, family = "binomial",
## data = loan_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.136427 0.396392 28.1 <2e-16 ***
## cibil_score -0.024212 0.000811 -29.9 <2e-16 ***
## loan_term 0.148309 0.011223 13.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5622.1 on 4240 degrees of freedom
## Residual deviance: 1919.0 on 4238 degrees of freedom
## AIC: 1925
##
## Number of Fisher Scoring iterations: 7
#evaluation step - confidence interval
confint.default(LogitM2)
## 2.5 % 97.5 %
## (Intercept) 10.3595 11.9133
## cibil_score -0.0258 -0.0226
## loan_term 0.1263 0.1703
#evaluation step - confusion matrix
confusion_matrix(LogitM2)
## Predicted Approved Predicted Rejected Total
## Actual Approved 2466 174 2640
## Actual Rejected 183 1418 1601
## Total 2649 1592 4241
#evaluation step - Hosmer and Lemeshow
hoslem.test(as.numeric(loan_df$loan_status) - 1, fitted(LogitM2))
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: as.numeric(loan_df$loan_status) - 1, fitted(LogitM2)
## X-squared = 182, df = 8, p-value <2e-16
#evaluation step - McFadden R^2
null_tLogit <- glm(loan_status ~ 1, data = loan_df, family = "binomial")
mcFadden = 1 - logLik(LogitM2) / logLik(null_tLogit)
cat("McFadden R-squared: ", format(mcFadden, digits=3), "\n")
## McFadden R-squared: 0.659
LogitM3 <- glm(loan_status ~ cibil_score + loan_term + loan_amount, data = loan_df, family = "binomial")
summary(LogitM3)
##
## Call:
## glm(formula = loan_status ~ cibil_score + loan_term + loan_amount,
## family = "binomial", data = loan_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.14e+01 4.17e-01 27.36 <2e-16 ***
## cibil_score -2.43e-02 8.14e-04 -29.82 <2e-16 ***
## loan_term 1.49e-01 1.12e-02 13.22 <2e-16 ***
## loan_amount -1.53e-08 6.43e-09 -2.38 0.017 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5622.1 on 4240 degrees of freedom
## Residual deviance: 1913.3 on 4237 degrees of freedom
## AIC: 1921
##
## Number of Fisher Scoring iterations: 7
#evaluation step - confidence interval
confint.default(LogitM3)
## 2.5 % 97.5 %
## (Intercept) 1.06e+01 1.22e+01
## cibil_score -2.59e-02 -2.27e-02
## loan_term 1.26e-01 1.71e-01
## loan_amount -2.79e-08 -2.72e-09
#evaluation step - confusion matrix
confusion_matrix(LogitM3)
## Predicted Approved Predicted Rejected Total
## Actual Approved 2466 174 2640
## Actual Rejected 181 1420 1601
## Total 2647 1594 4241
#evaluation step - Hosmer and Lemeshow
hoslem.test(as.numeric(loan_df$loan_status) - 1, fitted(LogitM3))
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: as.numeric(loan_df$loan_status) - 1, fitted(LogitM3)
## X-squared = 181, df = 8, p-value <2e-16
#evaluation step - McFadden R^2
null_tLogit <- glm(loan_status ~ 1, data = loan_df, family = "binomial")
mcFadden = 1 - logLik(LogitM3) / logLik(null_tLogit)
cat("McFadden R-squared: ", format(mcFadden, digits=3), "\n")
## McFadden R-squared: 0.66
All three models are highly significant to affect on loan approval, as they all have p-values less than the standard level of statistical significance of .05. Model 1 uses CIBIL Score only. It is highly significant and explains a substantial portion of variance with an R^2 of 0.620. Its high AIC 2155.6 indicates it’s the least efficient model. Model 2 is the same as Model I but with the introduction of the Loan Term. It shows an improvement in fit since the AIC drops to 1959.4 and increases the R^2 to .655. This model confirms that both CIBIL Score and Loan Term are crucial, highly significant predictors. Utilizing these two variables can capture the majority of the available predictive power in regards to loan approval. Model 3 is Model two with the addition Loan Amount. It achieves the best fit with the lowest AIC of 1954.7 and highest R^2 of .656, and all three variables are now statistically significant. The variables included are all statistically significant in this model, as the p-value of the loan-amount is .00967. This means we reject the null hypothesis of a non-significant variable and say that every variable is contributing statistically unique and valuable information to the model’s fit.
To answer the question that is posed, we reject the null hypothesis and state that Model3 with the three predictors, which include CIBIL score, loan term, and loan amount, is the best predictor and it outweighs the penalty of adding more parameters. Since Model3 has the highest McFadden R^2, lowest AIC, and variables being significant, it shows that Model3 accounts for a slightly greater proportion of variability in loan approval status than the other models.